1 Introduction

ARDJ (Acceptability Rating Data for Japanese) is an ongoing research project that began in 2016. Its aim is to lay foundations for Evidence-based Linguistics (EBL) echoing the perspective of Evidence-based Mediciine (EBM) which implements the idea of hierarchy of effetive evidence, giving a top priority to randomized controlled trial (RCT).

In 2019, ARDJ yielded its first dataset, called “Survey 2 Unified”, based on a large-scale, web-based aquisition of four-points acceptability ratings to 468 Japanese sentences.

In 2020, ARDJ yields another dataset reporeted here, which we call “s1-s2 RT data”. It comprises a compliation of reaction-time (RT) data obtained from three groups of colledge students in Hakodate, Tokyo and Gifu. In this report, we describe its strucute and presents a few analyses, i.e., PCA with unsupervised clustering (X-means or FuzzyCMeans).

2 Setups

2.1 Setting up parameters for analysis

### Parameters for graphics 
old.par <- par
## core graphics
mfrow.2x3.val <- c(2,3)
mfrow.2x2.val <- c(2,2)
ja.fn <- "HiraKakuPro-W3"
rm.fn <- "Lucida Sans Unicode"
#par(family = eval(ja.fn), mar = c(5,5,5,5), xpd = T, cex = 0.6)
knitr::opts_chunk$set(eval = TRUE, echo = FALSE,
                      fig.height = 5, fig.width = 6, 
                      par(family = eval(rm.fn), mar = c(4,5,4,5),
                          xpd = T, cex = 0.6))
### color palette
n.cols <- 11 # This is the maximum
require(RColorBrewer)
## Loading required package: RColorBrewer
my.cols <- rev(brewer.pal(n.cols, "RdYlBu"))
## matplot
ylim.val <- c(0,9)
## viloin plot
vio.ylim.val <- c(0,10)
vio.col.val <- "lightblue"
vio.median_col.val <- "magenta"
vio.box_col.val <- "blue"
vio.box_width.val <- 0.15

2.1.1 Filtering and clustering parameters

2.1.2 Sampling paramters

2.2 Setting up data

2.2.1 Sentence/stimulus data

The sentences that were used for stimuli are sampled below.

## Loading required package: readxl

2.2.2 Raw RT data

The RT data we obtained are sampled below.

## Warning: NAs introduced by coercion
## tibble [5,678 × 12] (S3: tbl_df/tbl/data.frame)
##  $ rid  : chr [1:5678] "h01" "h01" "h01" "h01" ...
##  $ gr   : chr [1:5678] "0" "0" "0" "0" ...
##  $ sid  : chr [1:5678] "s2-161" "s1-054" "s2-221" "s1-122" ...
##  $ resp : num [1:5678] 1 2 1 2 1 1 1 2 1 1 ...
##  $ RT   : num [1:5678] 2.211 0.896 1.492 1.812 2.084 ...
##  $ rt1  : num [1:5678] 1.05 0.183 0.85 1 1.217 ...
##  $ rt2  : num [1:5678] 1 1.63 1.73 1 1.03 ...
##  $ rt3  : num [1:5678] 1.15 1.17 1.27 0.95 1.03 ...
##  $ rt4  : num [1:5678] 3.38 1.92 2.89 3.01 3.3 ...
##  $ rt5  : num [1:5678] NA 0.896 NA NA NA ...
##  $ sane : num [1:5678] 1 1 1 1 1 1 1 1 1 1 ...
##  $ place: chr [1:5678] "Hakodate" "Hakodate" "Hakodate" "Hakodate" ...

2.2.3 Discarding extra rows

Since the dataset contains rows generated in an incomplete setting where sane = 0, we first exclude them for the analysis below.

## discard incomplete rows to produce rt.du
rt.du.raw <- subset(rt.dux.raw, sane == 1)
head(rt.du.raw)
str(rt.du.raw)
## tibble [4,902 × 12] (S3: tbl_df/tbl/data.frame)
##  $ rid  : chr [1:4902] "h01" "h01" "h01" "h01" ...
##  $ gr   : chr [1:4902] "0" "0" "0" "0" ...
##  $ sid  : chr [1:4902] "s2-161" "s1-054" "s2-221" "s1-122" ...
##  $ resp : num [1:4902] 1 2 1 2 1 1 1 2 1 1 ...
##  $ RT   : num [1:4902] 2.211 0.896 1.492 1.812 2.084 ...
##  $ rt1  : num [1:4902] 1.05 0.183 0.85 1 1.217 ...
##  $ rt2  : num [1:4902] 1 1.63 1.73 1 1.03 ...
##  $ rt3  : num [1:4902] 1.15 1.17 1.27 0.95 1.03 ...
##  $ rt4  : num [1:4902] 3.38 1.92 2.89 3.01 3.3 ...
##  $ rt5  : num [1:4902] NA 0.896 NA NA NA ...
##  $ sane : num [1:4902] 1 1 1 1 1 1 1 1 1 1 ...
##  $ place: chr [1:4902] "Hakodate" "Hakodate" "Hakodate" "Hakodate" ...

2.3 Outlier removal

We remove outliners before proper analysis.

2.3.1 Checking data before outlier removal

Here is a description of the data before applying filtering.

##          rt1        rt2        rt3        rt4        rt5        RT        
## breaks   Numeric,62 Numeric,66 Numeric,60 Numeric,35 Numeric,53 Numeric,43
## counts   Integer,61 Integer,65 Integer,59 Integer,34 Integer,52 Integer,42
## density  Numeric,61 Numeric,65 Numeric,59 Numeric,34 Numeric,52 Numeric,42
## mids     Numeric,61 Numeric,65 Numeric,59 Numeric,34 Numeric,52 Numeric,42
## xname    "d"        "d"        "d"        "d"        "d"        "d"       
## equidist TRUE       TRUE       TRUE       TRUE       TRUE       TRUE

2.3.2 Removing negative values

The data contains negative values for rt2. This is theoretically impossible but it occurred, perhaps due to misconfiguration of experiments. They are removed first.

## tibble [4,900 × 12] (S3: tbl_df/tbl/data.frame)
##  $ rid  : chr [1:4900] "h01" "h01" "h01" "h01" ...
##  $ gr   : chr [1:4900] "0" "0" "0" "0" ...
##  $ sid  : chr [1:4900] "s2-161" "s1-054" "s2-221" "s1-122" ...
##  $ resp : num [1:4900] 1 2 1 2 1 1 1 2 1 1 ...
##  $ RT   : num [1:4900] 2.211 0.896 1.492 1.812 2.084 ...
##  $ rt1  : num [1:4900] 1.05 0.183 0.85 1 1.217 ...
##  $ rt2  : num [1:4900] 1 1.63 1.73 1 1.03 ...
##  $ rt3  : num [1:4900] 1.15 1.17 1.27 0.95 1.03 ...
##  $ rt4  : num [1:4900] 3.38 1.92 2.89 3.01 3.3 ...
##  $ rt5  : num [1:4900] NA 0.896 NA NA NA ...
##  $ sane : num [1:4900] 1 1 1 1 1 1 1 1 1 1 ...
##  $ place: chr [1:4900] "Hakodate" "Hakodate" "Hakodate" "Hakodate" ...

2.3.3 Removing outlier responses

We decided to use standard deviance (sd) filtering after comparing it with the one using Mahalanobis distance.

In addition to exclusion of responses above threshold (sd.ub), the following analysis includes exclusion of outlier responses below threshold (sd.lb), which was not included in the work presented at JCSS37.

## [1] "Set SD upper bound (sd.ub) to 3"
## [1] "Set SD lower bound (sd.lb) to 0.1"

2.3.4 Checking data after outlier removal

Here is a description of the data after applying sd filtering.

##          rt1        rt2        rt3        rt4        rt5        RT        
## breaks   Numeric,62 Numeric,49 Numeric,60 Numeric,35 Numeric,53 Numeric,43
## counts   Integer,61 Integer,48 Integer,59 Integer,34 Integer,52 Integer,42
## density  Numeric,61 Numeric,48 Numeric,59 Numeric,34 Numeric,52 Numeric,42
## mids     Numeric,61 Numeric,48 Numeric,59 Numeric,34 Numeric,52 Numeric,42
## xname    "d"        "d"        "d"        "d"        "d"        "d"       
## equidist TRUE       TRUE       TRUE       TRUE       TRUE       TRUE

2.4 Separate rt5-free and rt5-full responses

The data contains rt5-free and rt5-ful rows. For the ease of analysis, it is convenient to separate them into two different datasets.

## 'data.frame':    268 obs. of  12 variables:
##  $ rid  : chr  "h01" "h01" "h01" "h01" ...
##  $ gr   : chr  "0" "0" "0" "0" ...
##  $ sid  : chr  "s1-054" "s1-075" "s1-067" "s1-109" ...
##  $ resp : num  2 1 2 2 1 1 1 2 2 2 ...
##  $ RT   : num  0.896 3.571 0.922 0.471 2.866 ...
##  $ rt1  : num  0.183 0.867 0.584 0.483 0.217 ...
##  $ rt2  : num  1.634 0.767 0.517 0.884 0.784 ...
##  $ rt3  : num  1.17 0.8 0.6 1.2 0.55 ...
##  $ rt4  : num  1.918 2.835 2.518 0.934 1 ...
##  $ rt5  : num  0.896 3.571 0.922 0.471 2.866 ...
##  $ sane : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ place: chr  "Hakodate" "Hakodate" "Hakodate" "Hakodate" ...
## 'data.frame':    4632 obs. of  12 variables:
##  $ rid  : chr  "h01" "h01" "h01" "h01" ...
##  $ gr   : chr  "0" "0" "0" "0" ...
##  $ sid  : chr  "s2-161" "s2-221" "s1-122" "s1-045" ...
##  $ resp : num  1 1 2 1 1 1 2 1 1 2 ...
##  $ RT   : num  2.21 1.49 1.81 2.08 4.35 ...
##  $ rt1  : num  1.05 0.85 1 1.22 1.08 ...
##  $ rt2  : num  1 1.73 1 1.03 1.67 ...
##  $ rt3  : num  1.15 1.27 0.95 1.03 1.93 ...
##  $ rt4  : num  3.38 2.89 3.01 3.3 5.55 ...
##  $ rt5  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ sane : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ place: chr  "Hakodate" "Hakodate" "Hakodate" "Hakodate" ...
## [1] "number of rt5-free sids: 446"
## [1] "number of rt5-ful sids: 32"

2.5 Plot responses by rid (= responder ids)

The following are plots of responses by participants, grouped by rt5-freeness.

2.5.1 rt5-free responses

## [1] "Sampled: TRUE; producing 14 plots"

2.5.2 rt5-ful responses

## [1] "Sampled: TRUE; producing 14 plots"

2.6 Plotting responses by sid (= sentence ids)

The following are plots of aggregated responses by stimuli, grouped by rt5-freeness. Unlike the aggregated version below, plots correspond to participants.

2.6.1 rt5-free sids

The following are plots of aggregated responses by rt5-free stimuli.

## [1] "Sampled TRUE; producing 14 plots"

2.6.2 rt5-ful

The following are plots of aggregated responses by rt5-ful stimuli.

## [1] "Sampled: TRUE; producing 14 plots"

## [1] "Ignored s1-001 due to insufficient responses"

2.7 Violin plots of aggregated responses

The following are plots of aggregated responses by stimuli, grouped by rt5-freeness. Response aggregation was performed by selecting medians for r1, rt2, ..., rt5, RT.

2.7.1 rt5-free stimuli

The following are plots of aggregated responses by rt5-free stimuli.

## Loading required package: plotrix
## [1] "Sampled: TRUE; producing 14 plots"

2.7.2 rt5-ful stimuli

The following are plots of aggregated responses by rt5-ful stimuli.

## [1] "Sampled: TRUE; producing 14 plots"
## [1] "Skipped s1-001 due to insufficient responses"

3 Analysis of raw responses

We apply unsupervised clustering analysis and PCA to raw responeses, differentiated by rt5-freeness.

3.1 Clustering rt5-free rows

## Loading required package: clusternor
## Loading required package: Rcpp
## List of 7
##  $ nrow   : num 4632
##  $ ncol   : num 5
##  $ iters  : num 20
##  $ k      : num 4
##  $ centers: num [1:4, 1:5] 1.656 0.635 0.529 0.56 0.898 ...
##  $ cluster: int [1:4632] 4 4 4 4 3 4 4 3 4 3 ...
##  $ size   : int [1:4] 59 654 1358 2561

3.2 PCA of raw rt5-free responses

## Loading required package: MASS

3.3 Clustering rt5-ful rows

## List of 7
##  $ nrow   : num 268
##  $ ncol   : num 6
##  $ iters  : num 20
##  $ k      : num 3
##  $ centers: num [1:3, 1:6] 2.446 0.478 0.527 0.853 0.564 ...
##  $ cluster: int [1:268] 2 3 2 2 3 1 1 2 2 3 ...
##  $ size   : int [1:3] 32 126 110

3.4 PCA of raw rt5-ful responses

4 Analysis of aggregated responses

Multivariate analysis applied to raw responoses is not revealing. What we want to know is properties of stimuli which are distributed over raw responses, and therefore latent at best. So, we then apply unsupervised clustering analysis and PCA to aggregated responeses, differentiated by rt5-freeness.

4.1 rt5-free aggregated responses

##  num [1:446, 1:5] 0.333 0.417 0.401 0.417 0.37 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:446] "s1-001" "s1-002" "s1-003" "s1-004" ...
##   ..$ : chr [1:5] "rt1" "rt2" "rt3" "rt4" ...

4.1.1 Clustering rt5-free aggregated responses

We now cluster rt5-free aggregated responses. The result is the following.

## List of 7
##  $ nrow   : num 446
##  $ ncol   : num 5
##  $ iters  : num 20
##  $ k      : num 4
##  $ centers: num [1:4, 1:5] 0.462 0.448 0.47 0.437 0.469 ...
##  $ cluster: int [1:446] 1 4 4 3 2 4 1 3 4 2 ...
##  $ size   : int [1:4] 107 38 153 148

4.1.2 Assign clusters to rt5-free data

4.1.3 PCA of rt5-free aggregated responses

We now plot PCA of rt5-free aggregated responses.

4.2 Remove outlier sids

It turned out that we should take out a few outlier sids below to get a beter understanding of the data under scrutiny.

## [1] "Removing outlier sids:"
## [1] "s1-069" "s1-133" "s1-022" "s1-004"
##  num [1:442, 1:5] 0.333 0.417 0.401 0.37 0.461 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:442] "s1-001" "s1-002" "s1-003" "s1-005" ...
##   ..$ : chr [1:5] "rt1" "rt2" "rt3" "rt4" ...

4.2.1 Clustering cleaned aggregated rt5-free responses

## List of 7
##  $ nrow   : num 442
##  $ ncol   : num 5
##  $ iters  : num 20
##  $ k      : num 4
##  $ centers: num [1:4, 1:5] 0.469 0.437 0.462 0.448 0.461 ...
##  $ cluster: int [1:442] 3 2 2 4 2 3 1 2 4 3 ...
##  $ size   : int [1:4] 149 148 107 38

4.2.2 PCA of rt5-free aggregated responses

We now plot PCA of rt5-free aggregated responses without outliers.

4.3 rt5-ful aggregated responses

##  num [1:32, 1:6] 0.217 0.339 0.871 3.953 0.378 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:32] "s1-001" "s1-009" "s1-018" "s1-019" ...
##   ..$ : chr [1:6] "rt1" "rt2" "rt3" "rt4" ...

4.3.1 Clustering aggregated rt5-ful responses

We now cluster rt5-ful aggregated responses using FuzzCMeans because X-means doesn't work.

## List of 7
##  $ nrow   : num 32
##  $ ncol   : num 6
##  $ iters  : num 87
##  $ k      : num 6
##  $ centers: num [1:6, 1:6] 0.217 0.389 3.94 0.623 0.43 ...
##  $ cluster: int [1:32] 1 2 2 3 5 4 5 6 5 5 ...
##  $ size   : int [1:6] 1 4 1 4 14 8

4.3.2 Assign clusters to rt5-ful data

4.3.3 PCA of aggregated rt5-ful responses

We now plot the PCA of rt5-ful aggregated responses.

4.4 Clusterwise violin plots of aggregated responeses

We end this report with clustewise plots of aggregated responses.

4.4.1 rt5-free

We first plot rt5-free aggregated responses clusterwise.

## [1] "Sampled: TRUE; procuding 14 plots for cluster 1"

## [1] "Sampled: TRUE; procuding 14 plots for cluster 2"

## [1] "Sampled: TRUE; procuding 14 plots for cluster 3"

## [1] "Sampled: TRUE; procuding 14 plots for cluster 4"

4.4.2 rt5-ful

We then plot rt5-ful aggregated responses clusterwise.

## [1] "Skipped due to insufficinet number of responses"

5 Discussion and conclusion

Responses we used for the current analysis are far from representative. They are too few and not varied enough. It is deadly necessary to reach a large pool of participants, with varied background hopefully, to make our results robust and more reliable. But we need (more) money to do that, honestly.